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ABSTRACT 

Honte Carlo sioulation procedures were used to study 
the psychoaetric characteristics of two two-stage adaptive tests and 
a conventional «pealced»« ability test. Results showed that scores 
yielded by both two-stage tests better reflected the norsal 
distribution of underlying ability. Ability estiaates yielded by one 
of the two stage tests were more reliable and had a slightly higher 
relationship to underlying ability than did the conventional test 
scores. One of the two- stage tests yielded an approxiaately 
horizontal infornation function, indicating acre constant precision 
of measureaent for individuals at all levels of ability. The 
conventional test and the second two-stage test yielded inforaation 
functions peaked at the aean ability level but dropping off at acre 
extreme levels of ability; however, the second two-stage test 
provided aore infornation than the conventional test at all levels of 
ability. The findings of the study were interpreted as indicating the 
potential superiority of two-stage tests in coaparison to 
conventional tests. Several improvements in the construction of 
two-stage tests are suggested for use in further research. 
(Author) 
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Monte Carlo simulation procedures wen* uscui {o stud\ t ht^ 
ps\ chome t f M- char%ic t ^-ri s t ics of two two-stage ada[>tive» fosts ;jn<J 
a conventional "peaked'* ability test^ Results sinewed itKtt -^eote 
yielded by both two-stage tests better reflected llie normal 
d.i s* r^ibut ion of iniderlxing abilit>. At;i 1 i t > esfiiUMtes \it>l(it.d 
■by one oV the two stage tests were more t^eliable rtnd fi;id 
a' slightly higher relationship to underlying ability than did 
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the conventional test scores* One oV thi> two-sta{^e tc^sts 
\i elded an approximately horizontal information function, 
inilicating more constant precision of measurement for 
i ml i v iilua 1 3 at all l*^v«»ls of ability. The conventional test 
and the second two-stage test yielded information functions 
peaked at the mean ability level but dropping off at more 
extreme levels of ability; however, the second two-stage test 
provided more information than the conventional test at all 
levels of ability. The findings of the study were interpreted 
as indicating the potential superiority of two-stage tests in 
comparison to conventional tests. Several improvements in the 
construction of two-stage tests are suggested for use in 
further research. 
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SIMULATION STUDIES OP TWO-STAGE ABILITY TESTING 



A promising new approach to the measurement of abilities 
has been made possible by the growth and refinement of time- 
shared computer facilities. This approach involves varying 
teat item difficulty during the testing procedure according to 
the estimated ability of the examinee and has been called 
tailored (Lord, 1970) or adaptive (Weiss & Betz, 1973) testings 

Two-stage testing is one approach to the implementation of 
adaptive testing procedures. The first stage of a two-^stage 
testing strategy consists of a short "routing^ test which is 
used to obtain a rough initial estimate of the testee's ability. 
Using this estimate, the testee is then "routed" to a longer 
second-stage or "measurement" test which consists of items close 
to his/her estimated ability level. The purpose, then, of two- 
stage testing is to enable the assignment of each individual to 
the measurement test most appropriate to his/her ability. 
Cronbach and Gleser (l9t>3) were the first to suggest the use of 
two-stage testing procedures. Weiss (197^) describes several 
variations of the basic two-stage strategy and compares them 
with other strategies of adaptive ability testing* 

The first reported study of the two-stage procedure was an 
empirical study by Angoff and Huddleston (1958) • Their routing 
tests were not actually used to assign individuals to measure- 
ment tests; rather, measurement tests were embedded within a 
large sample of items administered to all testees, and the 
performance of individuals was evaluated on those measurement 
test items they would have received had routing occurred. 
Resul ts showed that the measurement tests were more reliable in 
the sub-groups for which they were intended than were conventional 
tosts measuring a broader range of ability. Predictive validities 
of the measurement tests, using grade-point average as the 
criterion, were slightly higher than those of the conventional 
tests. Their data also showed, however, that 20^ of the testees 
would have been misclas«if ied , or routed into an inappropriate 
measurement test, on the basis of their routing test score. ^ 

A series of "real data" simulation studies of two-stage 
testing was reported by Cleary, Linn, and Rock (I968 a,b; Linn, 
Rock, & Cleary, 1969). In these studies, the responses of -^,885 
students to the 190 verbal items of the School and College 
Aptitude Tests and the Sequential Tests of Educational Progress 
were used to simulate four variations of the two-stage testing 
s tra tegy . 



Further information concerning the details of this study ano the 
remaining studies to be discussed may be found in Betz and 
Weiss (I973)f and Weiss and Betz (1973). 
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Correlations between the artiricial two-stage test scores 
(bast?d on n maximum of ^3 items) and scores on the 190-item parent 
test were almost as high as the reliability estimates of the 
parent test. In some cases these correlations were higher thru 
tfie correlations between the parent test and shortened conventional 
tests using more items than were used in the two-stage tests. The 
best short conventional test was found to require about 35^ more 
items to achieve the same level of accuracy provided by the two- 
stage test^ and it was concluded that two-sta'^e tests can permit 
large reductions in the number of items necf .ry to obtain 
accurate estimates of ability. 

Even more favorable were the findings that the majority of 
the cirtificial two-stage tests had higher predictive validities 
(using scores on the College Entrance Examination Board Tests 
and the Preliminary Scholastic Aptitude Tests as criteria) than 
did the conventional tests of the same length. The best two-stage 
tests had higher validities than longer conventional tests, 
including the 190-item parent test. These results demonstrated 
that twO"Stage tests can achieve high predictive accuracy with 
substantially fewer items than would be necessary in a convention- 
al test, although the data of Cleary et al., like that of Angoff 
anci Huddleston, showed a misclassif ication rate of about 20^. 

A series of theoretical studies of two-stage testing was 
reported by Lord (l971c). His analyses were based on the mathe- 
matics and assumptions of item characteristic curve theory (Lord & 
Novick, I wo8) , including the assumption that the probability of 
a correct response to an item is a normal ogive function of 
underlying or latent ability. All items were assumed to be of 
equal discriminating power, and the items within the routing 
tests or any one of the measurement tests were assumed to be of 
equal dif f icul ty . 

Lord ( 1971c) compared the two-stage tests with conventional 
tests (i.e., tests in which all examinees receive the same items 
in the same order). However, Lord's conven tionfi 1 tests repre- 
sented a theoretical icJeal in that they were assumed to be 
perr«»ctly poaked (i.e., all items in a test are of equal diffn- 
culty) at the mean ability level of the hypothetical population 
under study. As in the two-stage tests, all items were also 
assumed to have equal discriminations. Lord compeared the twb- 
stage and conventional tests in terms of information functions, 
whicli indicate tfie relative precision of measurement at various 
points alot)g the abi 1 i ty continuum. Precision can be defined as 
the capability of scores based on responses to a set of test 
items to accurately represent the **true ability*^ of individuals; 
the greater the precision at a particular level of ability, the 
smaller the standard of error of measurement ancJ the confidence 
interval in estimating true ability at that point. 

Lord found that the conventional test [>rovided more precise 
measurement for ability levels near the group mean, but that the 

s 
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two-stage procedures provided increasingly better measurement 
relative to the conventional test with increased divergence 
from the mean ability level. The finding that the peaked 
conventional test provided better measurement around the mean 
ability level has been supported by Lord's other theoretical 
studies comparing peaked ability tests with tests "administered" 
by pyramidal and flexilevel adaptive testing strategies (Lord, 
1^70, I »71a,b); thus, the peaked test always provided more 
precise measurement than the adaptive test when ability was at 
the point at which the test was peaked. However, as an individual's 
ability deviated from the average, the peaked test provided less 
precise measurement, and the adaptive test provided more precise 
measurement than did the conventional peaked test. 

Figure 1 presents a hypothetical illustration of how the 
comparative precision or measurement efficiency of conventional 
and adaptive tests would appear if the values of information at 
various levels of ability were connected to form a smooth 
curve. The figure shows that while the conventional peaked test 
provides superior measurement around the mean ability level, 
the efficiency of the adaptive tests is more constant across the 
range of ability and becomes greater than that of the conventional 
test beyond a given interval containing the mean ability level. 

The importance of these findings is that they indicate that 
the most precise or accurate measurement Tor any individual will 
be obtained by administering to him/her a test peaked at a 
difficulty level equal to that individual's ability level. 
Thus, test items should be of median, or p s .50, difficulty 
for each individual, rather than of median difficulty for a 
group of individuals varying in ability. 

An attempt to verify Lord's findings, by routing; each 
individual to that measurement test containing items peaked at 
median difficulty for him or her, was made in an empirical 
study of two-stage testing reported by Betz and Weiss (1973), 
This was the first study to employ computer-administration of 
test iteirs and computer-controlled routing to the appropriate 
measurement test within the two-stage paradigm. Each examinee 
was adminxsterad a two-stage test, consisting of a 10-item 
routing test and one of four '50-item measurement tests, and a 
40-item conventional test containing items peaked at the medi.in 
abilxty level of the group. The tests were readminis tered after 
an xntervai averaging 5 to 6 weeks in length so that estimates 
ol the test-retest stability could bo made. 

Results showed that the routing test had as high an internal 
consistency reliability as did the conventional lest, but in con- 
trast to Angoff and Huddieston's (l<>38) findings, the meapurement 
tosts were less reliable than wis the conveniional test. However, 
the restriction in ability range caused by tfie routing proceiJure 
would be expected to depress internal consistency reliability. 
The overall test-retest stability of the two-stage test (.88) 
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was as high as that oV the conventional tef*t (.89) find was 
higher (.9')) when calculated for only thos«' individuals who had 
received the same measurement test on the first and seconti 
administrations (thus receiving the same opportunity for 
memory of the previous item responses as was thp case with the 
conventional test). 

The routing procedure misc lassif ied only ^Id of the testees 
and was thus an improvement on the 20^ rates found in previous 
studies (Angoff & Huddleston, 19'58; Cleary et_ al . , iy68a,b). 
Hoi*ever, it was also found that the measurement tests were not 
of optimal difficulty for the groups of individuals assigned 
to them. 



Thus, the studies to date of two-stage testing have shown 
that it has the potential of providing greater accuracy of 
measurement and greater predictive validity using fewer items 
than is possible with conventional tests. However, each of 
these studies has had limitations which have restricted the 
generalizability or usefulness of the obtained rtsults. The 
generalizability of Lord's (l971c) results is limited by the 
assumption of "ideal" items. Angoff and Huddleston's empirical 
and Cleary et al . ' s "real data" simulation studies are limited 
by the fact that actual routing did not occur. In the empirical 
study of Betz and Weiss (1973), the small sample size (N » 21 4) 
and the lack of a criterion of "true" ability level prevented 
the calculation of the relative information or precision of 
measurement provided by the two-stage and conventional tests. 

The present study is, therefore, an attempt to examine ^hc 
generalizability of the previous findings using Monte Carlo 
simulatxon studies of responses to real test items, Monte Carlo 
studies offer several advantages over other methods of in- 
vestigating adaptive testing procedures. First, because large 
numbers of testee "records" can be simulated relatively quickly. 
It is possible to derive parametric estimates of the characteristics 
of scores yielded by various testing strategies. These estimates 
nre based on sample sizes sufficiently largo to ensure their 
representativeness. Second, the availability of an ability 
criterion permits the derivation of information functions and 
the calculation of their values at points along the hypothetical 
ability continuum. Third, Monte Carlo simulation studies 
utilizing two-stage tests composed of items previously ^dminis ter-- 
..d m empirical studies (e.g., Bet^ ^ Weiss, I<Ci) make it 
possible to determine whethe ^ empirical and simulation studies 
I»;ad to similar conclusions. Finally, should the results of 
simulatjon studies mirror those of the empirical studies, thus 
validating the simulation model, Monte Carlo methods can then h.. 
M^ffi to r.-ipidly iduijtify good designs for adaf-tivc tr-stiti// 
b> providing data concerning- the effects of variations in \ Iw 
characteristics of the adapt ivt^ testitif, t ra f j <• s . 
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Wiihin the rramewot'k of a two-Btat^* .jdaptivf slrate^-y, sonit* 
oV the charac t eri 5? t ics which may be varied include: i) the total 
number' of items given to a single examinee; 'i) the number of 
items* in the routing test; S) the difficulty level of the ^^outinf. 
tt'st; the d i.st ribu t i un of item difficulties in the routing 

test; i) the number of alternative measurement tests available; 
t)) the cutting points for assigning examinees to measurement 
te'st.s; 7) the tfifficuity levels of the measurement tests; and 
8) the distributions of item difficulties in the measurement 
tests. Kmpirical studies of tlie promising designs identified in 
the simulation studies can then be used to evaluate their per-* 
formance under live testing conditions. 



METHOD 

Dt's i^;n 

Tht' simulation studies were directed at examining ttie 
charcicter'ist ics of the two-stage test and determining whether 
tliis testing strategy snowed any adxs^-ntiges an compared to con* 
ventional ability testing procedures. The simulation studies 
were designed to permit the investigation of l) the characteristics 
of the score distributions yielded by the two-stage and conventional 
tests in comparison with that of the known ability distribution; 
j) the relationships between ability estimates deriveti from the 
two-stage strategy and the conventional tent; l) the parallel 
forms reliability of each test; -'4) the relationships between 
ability estimates and hypothetical underlying ability as 
specified by the simulation program; and 5) the amount of in- 
formation or precision provided by each testing strategy at 
var^'otjs points along the ability continuum* The first two 
characteristics examined replicated information obtained in tlje 
empirical study (Betz & Weiss, 1^)73) and were thus considered 
important to the genera 1 izabil i ty of these findings and to the 
valiciiry of the simulation model. in the simulation «tady, 
however, the obtained score distributions could be compared witfi 
th<» known ability distribution. 

Tht- tfiird chat ac tf..'i is t ir , parallel forms reliability, hrul not 
been .st'iditMl empirical ly, Hatfier, the emf^irical stutly exanriijed test- 
itftt-st stabilitv, or th^- reliabilit> oV the i^ame lf*st t>ver 
t imn intc^rviiU However, emfurically determined test-r^etest 
stabilit\ incliicJes as ;ystematic true score variance two sources^ 
of error which do not influence simulated test scores. Firsts 
! ht- conterit of an item may contribute error ciue to specific 
gaps or tMiipliases in the knowledge of a particular individual. 
With real subjects, characteristics specific to a particular 
item are likely to be stable and would thus be reflected in the 
r^'sponse on both test and retest. Second, memory of resfMvnHe.s 
mad^ on tht- first testing may influence the responses of real 
subjects to the items in the retest. SimtUated readminis t ra t j on 
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of the same test, which was the procedure in this study, is 
equivalent to the administration of two tests with items whose 
parameters are identical, and since item content does not in- 
fluence a simulated item response, the two tests can be thought 
of as perfectly parallel (Gulliksen, 1950) • In the present 
study, parallel forms reliability was considered to provide a 
lower-bound estimate of test-retest stability, because errors 
which act to inflate stability were not present, and also a 
lower-bound estimate of internal consistency reliability 
(Gull ford, 195^) . 

The last two areas of interest (relationships with under-* 
lying ability and information functions) cannot be studied 
empirically because underlying ability is not known for real 
subjects and because the derivation of information function^ 
requires inordinately large sample sizes. Thus, simulation 
studies make it possible to study important characteristics of 
the various testing strategies which cannot be studied using 
other research methods. 

Two two-stage tests were studied • Two-stage 1 consisted 
of the same items that had been administered in the empirical 
study (Betz & Weiss, 1973) ♦ Two-stage 2 was constructed to 
correct ♦'he problems of inappropriate difficulty levels and 
cutting points that were found in Two-stage 1. The conventional 
test studied was the same one used in the empirical study (Betz & 
Weiss, 1973), Each two-stage test was "administered" in con- 
junction with the conventional test so that the relationships 
between the resulting score distributions could be founds and 
all tests were administered twice so that parallel form reli- 
ability could be evaluated. 

Test administration was simulated for two samples of 
hypothetical testees. One sample consisted of 10,000 testees 
whose 'ability levels were assigned through random sampling from 
a normally distributed population of ability levels. The second 
sample consisted of 1,600 testees, 100 at each of 16 discrete 
ability levels distributed along the ability continuum. This 
distribution of ability levels, which will be referred to as the 
"equal-frequency" distribution, was generated for the sole 
purpose of providing estimates of "information" that were based 
on equal sample sizes at each selected point on the ability 
continuum. 

Thus, the overall design involved simulated test administra- 
tion under the following four conditions: 

Two-stage I and the conventional test, each admini s tert^d 
twice to 10 ,0O0 '^examinees" whose ability ieve 1 s were 
samplod from a normal distribution of ability levels, 

2. Two-stage 2 and the conventional test, each administered 
twice to 10,000 ^^examinees" whose ability levels were 
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sampled (independently of the Scimple taken in condition l) 
from the normal distribution of ability levels. 

'3, Two-stage 1 and the conventional test, each administered 
twice to 1 ,t)00 "examinees" whose ability levels constituted 
an "e(iiial"f requency " dis tribu t ion . 

•4 • Two-stage 2 and the conventional test, each administered 
twice to the same sample of "examinees" described in condi- 
tion 



Test Construction 

Montt* Carlo simulation of test administration does not 
involve the actual administration of test items. Rather, it 
uses only the input of the relevant item parameters into a 
formula expressing the relationship between ability level and 
response to an item with given characteristics. The item 
parameters selected for input into the simulation program used 
in this study were those characterizing the items constituting 
tests constructed for administration to real subjects. The 
following section, then, describes the manner in which these 
tests were constructed. 

Item Pool 

The item pool used to construct the empirical two-stage 
and conventional tests consisted of five-alternative multiple 
choice vocabulary items. The items were normed on college 
students, and normal ogive difficulty ("b") and discrimination 
("a") parameters were stored in the computer for each item. 
Details concerning the development and normlng of the item pool 
are reported by McBride and Weiss {197^)» 

Two-stage Tests 

Each two-stage test was composed of a 10-item routing 
test and four 30- i tern measurement tests . "Testees " were assigned 
to one of the four measurement tests on the basis of their scores 
on the routing test. Items within each subtest (e ♦g. , routing 
or measurement) were selected to concentrate around a given level 
of difficulty^ While it was not possible to select perfectly 
peaked stabtests given the limitations of a real item pool, the 
items within each subtest did distribute closel^ around the 
df-sired "b" (item difficulty) value. 

Two-stage 1 > In the construction of the f ir;^ t empirica 1 
two-stage test (Two-stage l), the difficulty level of the routing 
test was set to be somewhat easier than the median ability level 
of the group to account for the probability of chance success on 
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an item through random guessing (e,g», ^2 given ^-ii 1 terna tive 
responses) as suggested by Lord (1952, 1970). The difficulty 
levels of the measurement tests were distributed approximately 
evenly above and below the routing test difficulty. 

It was possible to select very highly discriminating items 
for the routing test because only ten items were required; the 
measurement tests included items of slightly lower discriminating 
power. However, highly discriminating items were considered more 
important in the routing test to ensure the accurate assignment 
of testees to measurement tests. Table 1 presents the mean item 
difficulty and discrimination values for the routing test and 
each measurement test. Both the normal ogive parameters 
(difficulty, b, and discrimination, a) and traditional item 
parameters (proportion correct, p, c^nd the biserial correlation 
with total score, r^) are presented. 

To make assignments to measurement tests, score ranges on 
the routing test of 0 through 3, h and 5* ^ and 7f and 8 through 
10 were used respectively to assign "testees" to the least diff- 
icult through the most difficult measurement tests. The 
lowest score range was the widest since it was expected to in- 
clude many "chance" scores. Further details on the construction 
and characteristics of Two-stage 1 may be found in Betz and 
Weiss ( 1 973) . 

Two^stage 2 . The second two-stage test (Two-stage 2) was 
constructed to improve on some of the shortcomings of the 
original two-stage test. First, the routing test was made 
slightly more difficult (mean bs:-*23) since the original routing 
test (mean bss«,5b) had proven too easy for the group as a whole 
and had created an imbalance in the assignment to measurement 
tests. Second, the difficulties of the measurement tests were 
changed in accordance with data concerning the appropriateness 
of the difficulty levels of the original tests. An examination 
of Table 1, which summarizes the characteristics of the items 
of Two-stage 2, shows that in general it was a more difficult 
test but with a smaller overall spread of item difficulties. 
Tests 3 and , the least difficult measurement tests, were made 
considerably more difficult than were the corresponding measure- 
ment tests in Two-stage 1. And, while the routing test items 
were as discriminating as those in Two-stage 1, thf» measurement 
test items were on vhe whole somewhat more discriminating. 
Appendix A gives item reference numbers (see McBride Weiss* 
197''0 ^r}d difficulty and discrimination values for each item of 
both two- St age tests. 

T\ie rou t ing test score inter vals used for a ssignmen t to 
measurement tests in Two-stage 1 were selected on the bcisis of 
essentially logical considerations. To f-^rmali/e and tiopefully 
improve the selection oV cutting points for measurement tests in 
Two-stage 2, the score intervals were determined by calculating 
a maximum likelihood estimate of ability for each possible/ routine^: 
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Table L 

Summary of item characteristics (norming values) 
for the two-stage and conventional tests. 



Test 



Number of 
Items 



Item D ifficulty 
Mean 



Mean 



Item Discrimination 
Mean ' Mean 



"a" 



Two-stage L 
Routing 
Measurement 

1 

3 

Mean 



10 



30 
30 
30 
30 



-.56 



1 .81 
.22 

■1 . 3^ 
• 2 .62 

-.49 



,62 



.2k 

.73 
.89 

.58 



.71 



.47 
.52 
.53 
.63 



55 



.57 



.42 
.44 
.46 
. 51 

.47 



Two-stage 2 

Routing 

Measurement 

1 
2 
3 
4 

Mean 

Convent iona 1 
test 



10 



30 
30 
30 
30 



40 



-.23 

1.73 
.35 
- .71 
•1 .60 

-.07 
-.33 



.55 

.23 
.43 
.64 
.80 

.53 
.56 



.70 



.53 
.68 
.61 
.68 

.63 



.54 



.5' 



.46 

. 55 
.52 

.55 



. 52 



.47 
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test score (O-IO) using the scoring formula described find 
assigning individuals to that measurement test closest in 
difficultv to their estimated ability (normal ogive parameter 
"b" is on the same scale is the ability estimate and thus a 
direct comparison can be made)* The resulting score intervals 
were 0 through , 3 and 6, 7 and 8, and ^^-10, Appendix B 
contains the ability estimates associated with each possible 
routing score and the resulting measurement test assignment* 



Scoring ^ Two-stage tests cannot be scored using a simple 
number-correct score since examinees take different measurement 
tests having different difficulty levels* The method of scoring 
two-stage tests suggested by Lord (l/7lc) takes both the number 
correct and the difficulty level of the items into account. It 
consists of obtaining two maximum likelih^ood estimates of 
ability (0)^ one from the rout:^ng test (e^) and one from the 
appropriate measurement test ( • These two estimates are then 
averaged after weif'jhting them inversely according to their 
estimated variances. 



The formula used by Lord to obtain estimates of 0 
routing and measurement tests was as follows: 



from the 



^1 



0-- * 



( x/m) - c 
1 - c 



+ b 



(1) 



where a is the normal ogive discrimination value 
of the i terns ; 



X 



is the number correct; 



m is the total number of items administered 
in that iiubtest; 

c is the chance-score level; 

b is the normal ogive difficulty level of 
the items in the subtest; 



-1 



(the inverse of is the relative deviate 
corresponding to a given norma 1 curve area . 



In the present study ^ equation I was modified slightly to 
account for the fact that the items in any given subtest wore 
not all of ecjua L discrimination and difficult). The formula 
tisfHj was as follows: 



0 * r: 



-1 



(x/m) - c 
I 1 - c 



+ b 
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where 7 represents the mean discrimination value, aiui 

h the mean difficulty of the items in that subtest. 

The value of c was always .2 since the items had five alternative 
responses. Whenever x=:m (perfect score) or x = cm (chance score) ^ 
6 -^annot be determined. Therefore, when x was equal to m, it was 
replaced by xssm« . 5=^9 . 5 ♦ and when x was Less than or equal to cm, it 
was replaced by x = cm-f • ^=2 . 5 • 

The two ability estimates and Oo routing and 

measurement test respectively as computed from equation 2 were 
combined into a total-test ability estimate by averaging the two 
af tf r weighting each by the number of items (10 or 30) on which it 
was based. This method of weighting was used instead of the 
variance weights used by Lord (1971c) since the latter method 
was found to have some disadvantageous characteristics (see 
Bet/ & Weisst 1973). The composite ability estimate, then, was 
defined by the following equation: 

§ « (10 + (30 O g) - ©3^+ 3 §2 (3) 

ZTD 5 

Scores determined in this way can be interpreted similarly to 
standard normal deviates, i.e., they have a mean of O and a 
variance of 1 • 

Con V en t i ona 1 Test 

The conventional test consisted of ^0 items. As in construc- 
tion of the two-stage tests, the use of a real item pool did not 
pet^mit the construction of a perfectly peaked or equidiscrimina t ing 
test as had been studied by Lord ( 1971c). Item difficulties were 
concentrated around a "b" value of -,33 (again somewhat easier 
than the median ability level of the group). While the range of 
difficulties was large for a peaked test, it was small in relation 
to the range of difficulties covered by all four of the second- 
stage measurement tests used in either Two*-stage 1 or Two-stage 2. 
Table 1 also summarizes the characteristics of the ^40-item 
conventional test. Appendix A gives difficulty and discrimination 
values for each item of the conventional test. Additional details 
on the construction of this test may be found in Bet>^ and Weiss 
( 1973t p. 13). Number correct was used as the score on the 
convt?n tiona 1 test. 

Simulation of Test Resp o nses 

The Simulation Model 

Development of the simulation procedure was based on thf? 
assumptions and mathemiitics of item characteristic curve theory 
(Lord A Novick, 1968). Tsing the mathematical model suggestcMl 
by Lord (1^70, 1971 c), the probability of a correct v^^spon^v 
to an item was assiimed to be a generalizt^d lorma 1 ogivo ftinctiofi 
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i)V the eXitmiutM' ' M abilitv and was cletermine<i through the solution 
of the following equation: 



In this fortni. la, (o) is the probability that an examinee with 

ability 0 will respond correctly to item !• The aj^, bj^, and cj^ 
are the nornn I ogive parameters of item i, where n± represents 
the d isjcriminci t ing power of the item, bj^ represents the diffi-* 
culty level of the item, is the guessing parameter, or the? 

probability that the item can be answered correctly through 
random guessing, and * [jkQ> "the cumulative normal distribution 
function^ represen t s the normal distribution cumulative 
proportion up to the relative deviate 



In the solution of equation 'i , was set at .2 since the 
item pool whs based on five-alternative multiple-choice items. 
Difficulty and discrimination parameters were those associated 
witii eacli item administered (see Appendix A), and ability level 
was specified as described below. 

Procedure 

AppencJix C describes the computer program wliich simulated 
test administration and calculated test scores. The data yielded 
by the program consisted of ability level and four ability 
estimates (two each from the two-stage and conventional tests) 
for each hypothetical subject. Kach "run'* of the program provided 
(fata for 100 hypothetical individuals; following each "run,^ 
thM Pearson produc t -moment correlations among test scores and 
between test scores and underlying ability were calculated for 
that group of 100 '^testees." 

Truierlying ability level was specified in two ways. To 
obtain a subject population with a normal distribution of 
abilities, a pseudo-random number generator yielding a normally 
Jistritmted set of numbers with a mean of 0 and a variancf* of 1 
was used to assign an ability level to each of 10,000 hypo- 
thetical individuals. To obtain the **equa 1 -frequency " 
distribution of ability, each of lb ability levels between 
B^-'i*'-! and 0=!f4-'5»l' were assigned to 100 individuals. The 10 
ability levels usefl are shown in Table 7. 

Once ability level had been specified, item "adr ' Istration'' 
was begun. Tlie parameters of the particular item to be ad- 
ministered were entered, along with the ability level, into 
equation V to calculate the probability ( P| (o)) of a 
correct response to that item. Following the calculati>n of 
I (o)» random number P from a rectangular distribution of 
r*^a I numbers betw^on O and 1 was generated. If < P^^ (o)» the 
Item was scored "J" (correct ), anri if P>P{(0) the item was scoreci 




ERIC 



"O'^ ( incori'ec t ) . Tht* item respotisc, 1 or O, was then stored in 
the computer for use in scoring the t^^st. 

in the conventional test, the items were administered in 
the order shown in Appendix Table In the administration 

of the two-stage tests, the routing test score (number correct 
of the first ten Items administered) was calculated and the next 
thirty items administered were those constituting the 
appropriate measurement test, using the routing rules described 
previously for each of the two-stage tests. 

Ana lysis of Data 

The basic set of data to be analyzed consisted of ability 
level, two scores on Two-stage 1, and two scores from the 
conventional test for each of 10, 000 " t estees . V For the second 
group of 10,000 "testees," the data consisted of ability level, 
two Two-stage 2 scores, and two conventional test scores. 
Analysis of the former data set was designed to replicate the 
analyses of the live-testing study reported by Betz & Weiss (l973) 
using the same two-stage test (Two-stage l) and the same con- 
ventional test. 

While it was assumed that samples of 10,000 ability levels 
generated from a normally distributed population would be 
normally distributed, the characteristics of the two resulting 
distributions of ability were analyzed to determine whether or 
not this assumption was valid. For each distribution of 10,000 
ability levels, the mean, variance, and the degrees of skewness 
and kurtosis were calculated. The degrees of skewness and 
kurtosis were tested for the significance of their departure 
from normality (McNemar, I969, pp. 25-29 and 87-88). Both 
distributions of ability were found to be normal. The means 
were 0.0, and the variances were 1.0, The degree of skewness 
was .010 for both distributions (as compared to the standard 
error of .025 given an N of 10,000), The degree of kurtosis was 
-.003 for the first distribution and -.0^ for the second 
distribution (as compared to a standard error of .05). 

A second set of data consisted of ability level and the 
same two sets of four scores as described above for 1 60O 
"testees,'' 100 at each of 16 ability levels. This data was used 
only in the calculation of values of the information functions 
at each of the 16 ability levels, while the data obtained from 
the two groups of 10,000 ^'testees" were used in all analyses to 
be described . 

Characteristics of Score Distributions 

Analyses of the characteristics of the score distributions 
were done separately for the two administrations (test and 
retest) of each test. For each distribution of 10,000 scores, 
the score mean, standard deviation, and the degrees of skewness 
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and kurtusis were calculated; the degrees of skewness and kurto* 
s^is were tested for the significance of their (iep^rture Trom 
norm.ility (McNemar, 1 , pp. 35-28 and 87-88). 

Parallel Forms Heliabilitv 

Pearson product-moment correlation coefficients were 
calculated to express the degree of relationship between scores 
obtained from the two administrations of each testing strategy 
tor each group of 100 individuals • Thus, there were 100 
reliability coefficients obtained from 100 samples from a hypo- 
thetical population having a normal distribution of underlying 
ability. The sampling distribution of these coefficients was 
used to make inferences to the expected value of the population 
value p and to construct confidence intervals within which p 
could be expected to fall in 95^ of such sampling experiments. 
The expected value of p was taken as the mean of the distribution 
of 100 r values, and 95^ confidence intervals were obtained by 
adding and subtracting from the expected value a value equal to 
two standard deviations of the obtained sampling distribution. 
Fishet^^s z-t ransf ormation was applied to each sample value of r, 
and the sampling distribution of values of Zr was also obtained. 
Confidenct/ intervals were then calculated using +2 standard 
deviations of this distribution, and the resulting values were 
transformed into their corresponding values of r. 

Interrelationships among Test Scores and between Scores nnd 
Underlying Abili ty 

Product-moment correlation coefficients were calculated 
between scores on Two-stage I and the conventional test and 
between scores on Two-stage 2 and the conventional test; the 
total score distributions of 10,000 scores were used in this 
analysis. In addition, eta coefficients were calculated for each 
total score distribution regressed on the other one, again 
usin^5 all 10,000 scores obtained from each testing strategy; 
tests of curvi 1 ineari ty were made to determine if there were 
non-linear relationships between score distributions. 

Similar analysis using both Pearson product-moment and eta 
correlation coefficients was done to determine the nature and 
degree of the relationship between Two-stage 1, Two-stage 2, 
nnd conventional test scores and ability level for all 10,000 
subjects. Thus, the values of r obtained using an N of 10,000 
provided one estimate of the expected value of p in the pop- 
ulation. The characteristics of the sampling distributions of 
the 100 product-moment coefficients and Z-transf ormed r^s 
calculated on each group of 100 testees were also calculat«*d and 
used to obtain expected v^ilues and confidence intervals for 
the p values. 
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Information Functions 

The information function is used to compare two or more 
strategies of testing in terms of the amount of information 
(or relative degree of precision of measurement) provided at 
different levels on the ability continuum. The value of 
information at each Level of underlying ability was calculated 
using the formula suggested by Birnbaum (l^^68): 



^ £ (X (Q) 
^ X I 0 



(5) 



where tx(9) indicates the amount of information provided by 
test X, scored in some specific way, at a given level of 
underlying ability 0. The numerator in equation 5 is the slope 
of the regression of observed test scores on underlying ability 
(calculated by solving the equation for the first derivative 
for that value of o), and the denominator is the standard devia- 
tion of test scores obtained by testees with ability 0 . 
This ratio is then squared to obtain I^{0), 



The numerator of equation 5 represents the capability of 
test scores to differentiate among examinees of different 
levels of underlying ability. For example, given examinees at 
two levels of ability and ©2 and expected test score values 
and x^, the magnitude of the slope 



x^- 



^2"^ <^1 



(6) 



indicates the degree to which the test discriminates these two 
ability levels. The denominator of equation 5 is the precision 
of measurement at a particular level of ability. The square 
root of (e) is inversely related to the confidence interval 
for estimating underlying ability from observed score (Green, 
1970). Thus, a low value of (0) indicates a larger confidence 
intf^rval and a larger standard error of measurement at a partic- 
ular level of ability, and the higher the value of the 
narrower the confidence interval and the smaller the error of 
measurement. Information values are not meaningful in any 
absolute sense because they are dependent on the scale used to 

measure Q and also on the scoring formula used to determine^ x, 
but information values calculated from two or more strfttegios 
assuming the same © scale can be directly comparf»d, with Icirgt^r 
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values indicating mote precise measurement. 

The relative amount of information provideci by the two- 
stage and convent inna I tests was calculated for both the 
normally distributed and "equa L*f requency ^ iis tribu t ions of 
ability. The regression equation relating test score (the 
dependent variable) to generated ability (the independent 
variable) was calciilated from the normal distribution data using 
a least squares curve-fitting computer program. The third 
degree or cubic polynomial equation generated was used since 
higher degree polynomial equations did not significantly reduce 
the standard error of estimate of the dependent variable (i,e., 
test score). The slope function for each test was obtained by 
taking the first derivative of the third degree polynomial 
equation describing its regression on generated ability. 

The normal ability distribution was divided into 33 inter- 
vals between 0 to © «^3-3- Each interval had a width of 
,2^ and the midpoint of the interval was used to calculate the 
slope of the function at that level of ability. Thus, the 
lowest ability interval was 0a-.3^3 to 0 2s-3,l, and 0 
was taken as its midpoint. For each interval, the variance of 
the test scores of individuals whose hypothetical ability level 
fell into that interval was calculated, 

Wlien the normal distribution of ability was used, however, 
the number of individuals within each interval differed at all 
points along the ability continuum. That is, since interval 
length was constant, large numbers of individuals fell into the 
intervals in the middle of the continuum, while the ability 
intervals at or near the extremes had considerably fewer indi-^ 
viduals. Thus, information values for extreme ability levels 
were less stable than those nearer the middle because the score 
v^ariance was more influenced by chance similarities or differ- 
ences among scores determined for individuals of approximately 
the same ability. 

As a result, the '*equal-f requency " distribution of ability 
was used to obtain information values of equal stability or 
reliability at all points along the ability continuum. The 
slope value used was that generated from the normal distribution 
and was computed at each of the l6 ability levels indicated in 
Table 7* Thus, the numerator of the information equation 
was the squared slope at ench ability level, and the denominator 
was the variance of the 100 scores generated at that level. 

Since each test was admini stered twice to each sampl e of 
"testees," there were two sets of information values for each 
test. These values were averaged to obtain an overall index 
of information at each ability level for each test. Finally, 
the mean ancJ standard deviation of each set of 33 information 
values (obtained from the normal distribution of ability) and 
16 values (obtained from the "equal -frequency " distribution) 
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wetf calculated. The nu»an oV »»ach st't was interprot^d as an 
iruit^x of tho general level of information provideil by each 
tf;..t, whilo the standard deviation was considered tc provide 
an indication of the constancy of i nl\)rni.t t i u!i pruviiii'd Vav 
the ftitire range of ability levels sampled. 

HKSIXTS 

Sc u re Pi s t r 1 bu t i on s 

Table 2 presents data describing the distributions of 
scores obtained from the two-stage tests and the conventional 
test. Data are presented for both administrations (test and 
retest) of each test. Since the data derived from administration 
of the conventional test with Two-stage 2 were identical to 
those oT the test when administered with Two-stage 1, only the 
latter* set of results is presented. These values can be 
considered as representative indicators of the characteristics 
of the conventional test used in this series of studies. 

Two-stage 2, the improved two-stage test, resulted in a 
distribution of scores which better reflected the underlying 
distribution of ability (normal with mean O and variance l) 
than did Two-stage I, The mean score on Two-stage .? was 
essentially O, and the standard deviations (1.06 and 1.05) 
were closer to 1,0 than those of Two-stage 1 (1.2^ and 1.22). 
The skewness of the Two-stage 2 score distribution did not 
show a significant departure from normality, while the dis- 
tribution of Two-stage 1 scores was significantly skewed in the 
negative direction. While the Two-stage 2 distribution was 
significantly more platykurtic (flat) than a normal distribution, 
the degrees of kurtosis (-.20 and -.23) were less than those 
of Two-stage I (-.■'*2 and -.49). 

Both two-stage tests showed less skewness than did the 
conventional test, in which scores were significantly 
negatively skewed (-.25 and ,-23). The conventional test score 
distribution was also platykurtic, to about the same degree ns 
that of Two-stage 1 and to a greater degree than that shown 
h\ Two-stage 2. Thus, the score distribution yielded by Two- 
stafie 2 better reflected the underlying normal distribution 
of ability than did the conventional test. 

The score distributions yielded by Two-sta«e I and tht> 
conventional test in the empirical study (Betz & Weiss, l'^7'0 
were not skewed; both distributions, however, tended toward 
platykurtosis , and this tendency was statistically signxficant 
in the conventional test scores in the empirical study. 

Parallel Forms Reliability 

Table 1 presents the characteristics of the sampling 
distribution of parallel forms reliability coefficients. Again, 
the results from the conventional test were identical for the 
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administrations with Two-sta^jo 1 and Two-stage 2, ho only ont.' 
set oV data is presented for the conventional test. 

Table i shows that Two-:Jta^?e was more reliable (Ps=.8'j) 
than either the conveti t i ona 1 test (ra*80) or Two-*stage 1 
{rs:,7^>)» Further,- th«» range and variability oV the 
distribution of reliabilit) coefficients was smallest for 
Two-stage 2, indicating more consistency in reliability estimates 
dt»termined from the 100 samples. The obtained confidence 
intervals indicate that the reliability of Two^-stage I is 
probably between •Ot> and .HSt with an expected value of •Tb. 
The reliability of Two-stage 2 is probably between ^75 and » 90 
{expected value .83) f and that of the conventional test pro- 
bably falls in the interval between .72 and .87f with an 
expected value of .SO, 

He I a t i c)f)shij>s between Two-^staj^e and Conventional Test Scores 

Table h presents the linear ( product -moment ) correlations 
anci eta coefficients describing the relationship between scores 
on each two-stage test and conventional test scores. All of 
the coefficients we^e significantly different from zero (p<tOOl) 
and indicate a high and predominantly linear relationship be- 
tween scores obtained from the two methods of testing. Although 
two of the eta coefficients indicated a significant degree of 
curvilineari ty , the absolute increase in the degree of 
relationship with curvilinearity taken into account was very 
small and not practically significant; with a sample size of 
10,000 very small curvilinear trends may attain statistical 
significance . 

Two-stage 2 showed a higher degree of linear relationship 
(r=,82) with the conventional test scores than did the original 
two-stage test (rs=,78 or ,79) and thus accounted for an additional 
t)^ (o7^ versus of the variance in the conventional test 

Mcores. These values may be compared with those obtained in the 
live-testing study of Two-stage 1 and the same conventional test 
(Betz & Weiss, 1973), where the linear relationships between tlie 
tests were r=.80 and r^.8U on test and retest, respectively, thus 
accounting for and 70^ of the variance. These values compare 

quite closely to the values obtained in the present study, and, 
similarly, there was no evidence for important curvilinear trends 
in the empirical data. 

Relationships between Test Scores and Ability 

Table 5 presents the degree of linear and curvilinear 
relationship between test scores and generated ability level 
when calculated using all 10,000 scores obtained from each 
testing strategy. All of the coefficients were significant at 
p<,001 and, again, the relationships were high and predominantly 
linear. Examination of the bivariate scatter plots did not show 
clear curvilinear trends, and the eta coefficients do not add 
importantly to the degree of linear relationship found* 
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Table U 

Regression analysis of relationships between 
two-stage scores and conventional scores (N=siO,000) 



Test and Index of Relationship Time 1 Time 2 

Two-stage 1 and conventional 

Product-moment correlation .79 •l^ 

Regression of two-stage scores 

on conventional scores (eta) .79 ^78 

Regression of conventional scores 

on two-stage scores (eta) .79 .78* 

Two-stage 2 and conventional 

Product-moment correlation .82 .82 

Regression of two-stage scores 

on conventional scores (eta) .82* .82 

Regression of conventional scores 

on two-stage scores (eta) .82 .82 

^Degree of curvi 1 ineari ty significant at p<.001. 
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Table 5 

Regression analysis of relationships between 
test scores and r ^ility (NsslO,000) 



Test and Index of Relationship 



Time 1 



Time 2 



Two-stage 1 

Product -moment correlation 

Regression of two-stage scores 
on ability (eta) 

Regression of ability on two- 
stage scores (eta) 



.87 
.87 
.87 



.87 
.87 
.87* 



Two-stage 2 

Produc t -moment correlation 

Regression of two-stage scores 
on ability (eta) 

Regression of ability on two- 
stage scores (eta) 



.91 
.91 
.91 



.91 
.91 
.91 



Conventional 

Product-moment correlation 

Regression of conventional test 
scores on ability (eta) 

Regression of ability on con- 
ventional test scores (eta) 



.90 

.90* 

.90 



.90 

.90* 

.90* 



*Curvilinearity statistically significant at p<.005. 
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Two-Stage 2 showed the hi^jhest ro ? a t i oiishi p to tinder I Nin^; 
abilitv {r=.^n)^ Tollowed by the conventional ti^st (rs,«K)) 
and Two-sta^o 1 ( r= , ^<7 ) , Thus, undt^rlying ability level 
accounted for approximatt* ly 8')^ oV the variance in Two-sta^^e :! 
scores, 81^ of the variance in conventional test scores, and 
7t>^ or the Variance in scores on Two-stage !• 

Table ti presents the characteristics of the sampling 
distributions of the obtained and Z- transformed produc t -moment 
coefficients calculated on 100 groups of 100 testees. A compari- 
son ot the mean values shown in Table b with the values in Table 
!>• calculated only once for 10,000 testees, shows that they are 
idtmtical except for the conventional test, where the mean value 
of lOO coefficients is .89 (Table b) and the value for ail 
10,000 testees (Table 5) is ,90. 

Examination of the confidence intervals within which the 
true population correlation ( p) may be expected to fail shows 
that the two methods of calculation, using the obtained 
(iist ribution or r or the distribution of transformed r's, 
yield very similar results. The transformed coefficients 
\ield an interval of between .81 and .92 for the true relation- 
ship between Two-stage 1 scores and generated ability, .86 to 
.94 for Two-stage 2 scores and genera ted abil ity , and .85 to .93 
for the relationship between scores on the conventional test 
and underlying ability. 

Information Functions 

Equa 1- frequency distribution . Table 7 presents the val ut>s 
of the information function ( 1^(9) ) for the two-stage and 
conventional tests at each of sixteen ability levels. The 
value at each level represents the average of the values 
obtained from the two administrations of each test; separate 
values for the first and second administrations may be found 
in Appendix Table D-1. These values may be compared directly 
among tests and are equally reliable for each ability lev<>l. 
Table 7 also presents the mean and standard devitition of the lb 
values obtained for each test. The data contained in Table 7 
are summarized in graphic form in Figure 2; the point values 
have been connected and the curves visually smoothed to convey 
the shape of the information functions for the three tests. 
(The unsmoothed information functions for the three tests are 
contained In Appendix E). 

The shape of the information curve for the conventional 

te.»st, as shown in Figure 2, is very similar to that found in 

Lord* J ( 1971c) theoretical study; that is, the information values 

are highest at the center of the ability distribution and drop 

off sharply at the extremes. Both Lord^s results, usin^ ''ideal'' 

items, and the results indicated here, using a set of items 
with parameters that are tvpical of those occurring in 

empirical test construction and which did not permit the con- 
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Table 7 

Values of the information function (1^(0)) for 
two-stage and conventional tests at points along 
the continuum of underlying ability (equal-frequency 
distribution of ability, N=200 at each level) 



Level of 
Ability 10} 


Two-staf^e 1 




on V dj V X 0110 X 


3.2 


2.51 


i #U J 




3.0 


<C • ^ J 


1 7 1 
J- • ( J 




2. 5 


3.^3 


3 • M 




2 • 0 




3 • y^^ 


J ♦ ^y 


1 . 5 


3 . 59 


• J" 




1 . 0 




4 • yO 




- 5 


3 . 12 


70 
5 • / ^ 




« 1 


<c . 0 / 






- . I 


2 , 06 




it '^M 


-.5 


3.60 


^.91 


i^.25 


-1.0 


3.86 


5.22 


3.53 


-1.5 


3.66 


4.76 


3.01 


-2.0 


3.19 


2. 5B 


2.50 


-2.5 


2.43 


2.94 


1. 32 


-3.0 


2.^1 


1^13 


. 39 


-3.2 


2.10 


.79 


.15 


Mean 


3.06 


3.72 


2.71 


S.D. 


.61 




1.81 
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struction of a perfectly peaked convt?nt ionai test, show that a 
conventional test offers greatest precision of measurement for 
intiividuals nenr the medinn ability level of the group and 
decreasing precision with divergence of an individual's ability 
from the median level. 

Figure 2 shows that Two-stage 1 provided more constant 
information across ability levels than did either Two-stage 2 
or the conventional test; it provided less information around 
the median ability level but more information at the extremes. 
The results for Two-stage 1 were similar to those found by Lord 
(1971c) in his theoretical study of two-stage tests. However, 
the results for the improved two-stage test were quite different 
from those obtained in Lord * s theoretical studies. The 
information curve for Two-stage 2 was more similar in shape 
to that of -^he conventional test, showing greatest precision in 
the center >f ability distribution and a loss in precision at 
the extremes. However, at every ability level, its information 
values were higgler than those of the conventional test. 

Th*:' overall level and shape of the information functions 
shown ir» Figure 2 are also reflected by the means and standard 
deviations of the information values for each test, as shown in 
Table 7# The average value for Two-stage 2 was 3»72, higher 
than that for Two-stage 1 (3.06) and the conventional test (2.71 )• 
The tendency of Two-stage 1 to yield a horizontal information 
function rather than a peaked one, indicating more even or 
constant precision of measurement, is reflected by the small 
standard deviation of information values (♦6l) as compared to 
that of Two-stage 2 (1.68) and the conventional test (l.8l). 

One way to inter pre t information values is in terms of the 
relative numbers of items necessary to achieve equivalent 
precision of measurement for a given individual. For example, 
if for a specified level of ability, inf^ormation for Test A is 
twice as great as the value of information for Test B, it 
indicates that lest B would require twice as many items as would 
Test A to achieve the same level of precision of measurement. 
Thus, the values shown in Table 7 indicate that at 0«2.5f the 
conventional test would require nearly three times as many items 
to achieve the same level of precision as provided by Two- 
stage 2 f,or individuals of that ability* At o the 
conventional test would require 39^ more items, at 0=s-l.O it 
would require ^7^ more, and at 0«-2.5f the conventional test 
would require over twice the number of items. 

Examination of the points at which the three curves shown 
in Kigure 2 intersect indicates comparative information or 
precision for ranges of ability. Two-stage 1 and Two-stage 2 
intersect at about ©--2.0 and 0=4-2.0; Two-stage 2 was superior 
within this range, and Two^stage I was superior beyond it* 
Two-stage 1 was superior to the conventional test when 
0 > 4-1 . 3 and© <-1.0, and Two-stage 2 was superior to the conven- 
tional test at all levels of ability. Thus, of the three tests. 
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Two-stage 2 provided most precise measurement (least amount 
of error) for testees whose abilities were between -2,0 fin<l 
•f^'.O standard ciev^iations from the population average, and Two- 
sta^^e 1 provided moit» accurate Oieasurement for testees whose 
ability was beyond this range* These data indicate that 
at least one i)V the two-sta^^e tests provided more accurate 
measurement than tlie conventional test at all levels of 
i i b i I i t \ . 

Normal abil i ty flis tribution , Table 8 presents the val ues 
of 1^(0) provided by the two- stage and conventional tests under 
the assumption of a normal distribution of ability; again, 
tliese values represent the average of the values obtained from 
the two administrations of each test (separate values of the 
first and second administrations may be found in Appendix 
Table D-2). Table 8 also indicates the total number of 
"testt'es" upon which each value of lx(0) based. For example, 

only two ••testees" were assigned an ability level between 
Qss'i.l and 0=3.3 in the Two-stage 1 administration. Thus, 
with two administrations of the test, the lx(0) value is based 
on a total of four test administrations. Obviously, the 
1^(0) values based on N's of -4, 1^, or 30 at exti'eme ability 
levels cannot be considered to be as representative of the true 
information value at that ability level as may the 1^(0) values 
for abilities near the mean which were based on N*s of 1 5OO or 
1600. Again, the mean and standard cJeviation of the values 
for each test are presented. 

The results indicated in Table 8 are summarized graphically 
in Figure 3» which shows the smoothed information curves for the 
throe tests. Appendix E (Figure shows the raw curves for 

the normal distribution data. 

Given the differences in the reliability of the values 
determined from the normal and "equal-frequency" distributions 
of ability, the results are remarkably similar. As shown in 
Figure 3, Two-stage 1 again displays very constant information 
along the ability continuum, while Two-stage 2 and the conven- 
tional test show high levels of precision around the meclian 
ability level but losses of precision at the extremes. Two- 
stage 2 again, however, provides more information at all levels 
of ^ility than does the conventional test. 

The means and standard deviations of the information values 
shown in Table 8 indicate that Two-stage 2 provided the higtiest 
overall level of information (3.89)f but that Two-stage 1 
provided almost as high an average value (3.59). However, 
Two-stage 1 had substantially less variability in the distribution 
of obtained values (.96 as compared to I . 36 for Two-stage 2). 
The conventional test provided the least amount of information 
ovt'rali (2.8t)) and its values were the most variable ( ! . 5? ) , 
indicating least tencienc> toward constant precision across the 
ability cont inuum. 
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V.I 111.- <»f th»' i iifcM'tna t i on ftnuMion Vor two-star,e nnci conventional 
t^vsl.s within intei valH of thi> f oti t j nuum <jf uiuter 1 > i n^; iibility 
tuuItT the assumption of a normal d i t rubn t i on of ability (valufs 
art' t h(« averag*' of two aclrni n i s t ra t i ons and ai f based on the 
IndiCcjted total numbers of hypothetical individuals) 
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Tlio iiitofst'C t i ons of t lif inlor-ttia t i on i-invf?^ iir Kir,««i~«' 
again show that Two-st af^f :i pt^ovided more i tirottna 1 1 oti than 
two-Hta^^^«' I within l lif itjtt'fval to f)= + ^?.0, whih* 

Two-f«tage I was HUpt'tit)f bt-vomi thin intt't'Val. Two-sta^jo I 
was superior to the r onv»>ri L i otja I tt'st at ability ifvels ^;rt'at«'r 
than Gs + 1.') atui less than Qs-l.'}, w!i i 1 e Two-sta/^o was 
superior to tho c on veii t i orja I lest at essontially all lev«Ms ol" 
ability. 

In f^eneral, the information functions tlerived from both 
the nortiial and "eqtia I -frequency " distributions of ability show 
that Two-stafje 2 provided mof^t precision for ability levels 
within two standard deviations of the mean ability level, while 
Two-stafje I provided most precision outside that ran^jt?, 

CONCLUSIONS 



Both two-stage tests yielded score distributions which 
better reflected the normal distribution of ability than did 
the conventional test. However, the impi'oved two-stac^' test 
(Two-stage 2) was superior in this regard to both the original 
two-stage test (Two-stage i) and the conventional test. All 
score distributions showed a significant degree of pin ty kur t os i s 
and were thus flatter or more rectangular than the normal 
distribution. This may be explained by the fact that the two- 
stage test is designed to "spread" people out by concentrating 
item difficulties at levels along the ability continuum appro- 
priate to each individual's ability. The plat ykurtosis of 
conventional test scores may be due to the fact that the test 
was not perfectly peaked, / 

Two-stage 2 provided scores that were more reliable than 
were scores obtained from the conventional test or from Two- 
stage 1. However, all three reliability .estimates were low, 
ranging from .76 for Two-stage 1 to .83 for Two-stage 2, This 
is perhaps due to the fact that the method of estimating re- 
liability, the correlation between two parallel forms with no 
time interval between administrations, includes fewer sources 
of sNstematic variance which are included with the .scorx- variance 
instead of with error variance than do most methods of estx- 
mating reliability. For example, the reliability coefficients 
obtained in the present study can be compared with the t<-«t- 
retest stability coefficients of Two-stage 1 and the same 
conventional test, as studied in Bet/ & Weiss ( I ^> < .U • Hie 
stability of Two-snage 1 was .88 and that of the cm.vontxo.K. 1 
test was .89. While no stability data is yet available lu 
Two-stage 2, it is reasonable to infer' that, given its high i 
parallel fo^ms reliability (which was ac tua 1 1 y de term ned t hrour,h 
re-administration of the same test), it would be substan la ly 
more stable than either Two-stage I or the conventional tef,t 
Thus, ho correlations between scores obtained from two samnlated 
Idmi;;istrations of the same test are lower than those obtained 
from the test-retest design with an interval of about tive to 
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six weeks betwt*en ri<lm i ri i s t r'a t i on s in tht- empiricnl sMidy* 

This result can be attributed t () l) tUv I'art thai "t»rro! 
in simulalei.} t^*st r**spf)!ise.s is entirel> raiulom and iloes not 
contain some stablf» item-specific variance and 2) thr abst'nc** 
of memory c*t*fects* Kfft»cts of memory on tes?t~retest stability 
were found b\ Bet/ antl Weiss (Im7'0j the stability correlation 
for individuals who had taken the samt> measurement test on 
retesting (thus repeating all -40 items) wos ^9'}^ as opposed to 
the value of ,88 found for the group as a whole, many of whom 
hai! taken a diffez'ent measurement test on retosting* Similar 
memory effects were found by Larkin and Weiss (l97'4a) in n 
study comparing conventional tests find pyramidal adaptive tests. 

Thus» the reliability values obtained in the present study 
can bf> considered lower-bound estimates of the stability of 
test results obtained from two administrations of the same test^ 
where knowlecige that is stable but s pecific to particular item 
content does not enter into the stability of obtained scores and 
where the responses of an individual are not affected by pre- 
vious measurement of the ability (i.e., memory). It should be 
noted that the obtained parallel forms reiiability coefficients 
are also lower^bound estimates of the internai consistency 
reliability of the tests (Guilford, i^^5^? Stanley, 197l). 

The relationship between two^stage and conventional test 
scores was relatively high (.78 to .82) and primarily linear, 
although Two-stage 2 showed the higher relationship to the 
conventional test scores* These data indicate that although 
a majority of variance is shared by the two testing strategies 
(two-stage and conventional), of the variance of eith<»r 

str.itegy is left unaccounted for. 

Ability estimates yieided by Two-stage 2 showed a higher 
relationship to underlying ability (rs:.9l) than did ability 
estimates yielded by Two-stage I (rs=,87) or the conventional 
tost {r^s;.90 when based on the sample of 10,000 and r=.89 leased 
on th<' mean of the sampling distribution of 100 coefficients). 
It is interesting to note that the correlations between test 
scores and underlying ability are equal to the sc|uares of the 
reliability coefficients, which is the prediction jielded 
by psychometric theory (Gulliksen, 1950). Tlius, the square of 
.91, the correlation between Two-stage 2 scores and ability, 
is .8'?, tl)e reliability of Two-stage 2. The reliabilit> of th<' 
convt^ntional test (.80) is between .89-(:=.7^>) anti .902(=:,Hl), 
and the reliability of Two-stage I (.70) is equal to the square 
of its correlation with ability (•87*"-»7o). 

The findings regarding t{ie information or rel a ti ve {>r€»cision 
of measurement at various points along the ability continuum 
support the conclusion that two-stage testing strategies can 
provide greater comparability of the piM'cision of abilit\ 
estimates for individualr. at all l^?vels of ability represented 
in a given population. Two-stage I yielded approximately 
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h«)fi/i)tit.i I information functions, \^ith ability I's t i m.i t os for- 
in«l i V icliia I » who5?f .ibilit y levels fell throe or niore standanl 
ilov i a t i ons from tlie fTu*ari hoiti^; nonrly art procise as those for 
mdiviiJuals of avora^;*.- ability. Further, the average level 
of information provided by Two-sta^je I was ^reatei- than that 
provided by th»' conventional test and only somewhat less tljan 
that provided by Two-stage 2, which yielded the highest level 
of overall information. Two-stage , while showing a loss 
in {jrecision at the extremes, yielded more constant levels of 
informal iot) than did the conventional test (as indicated by the 
smaller standard deviation of infox-mation values). 

The failure of Two-stage to yield a horizontal information 
turiction may be due to the strategy used in constructing the 
test. The average difficulties of the Two-stage 2 measurement 
tests were chosen to be closer to median ability than were those 
in th«' Two-stage I measuremerJt tests (see Table l); this was tlone 
in art attempt to maximize the appropriateness of each measurement 
test for the i^ rg ti jp; of iiidividuals assigned to it. The attempt 
was found to V)e succe^^sful in an empirical study of Two-stage 2 
(Lark in <t W»^iss, iwT'+b), Thus, Two-stage I! was composed of 
measurement tests more appropriate for individuals near the 
group mean and less appropriate for individuals whose abilities 
were fiear the extremes. The result was a test which did not 
h.jve tlie approximately horizontal information function found by 
Lord (t>7lc) in his theoretical studies, or by the Two-stage 1 
test in this study. Rather, the Two-stage 2 information func- 
tion was more similar in shape to that of a conventional test, 
but at a higher level. It would appear that for two-stage tests, 
just as for peaked conventional tests, the advantages of 
maximizirjg the appropriateness of item difficulty for a group 
of Individuals are offset somewliat by a loss in the precision 
ot measurement for individuals whose abilities are not near 
the mean of the target group. 

The finding that Two-stage 2 provided more precise measure- 
ment than the conventional test must be interpreted with caution 
because the average discriminating power of the items used in 
constructirjg Two-stage 2 (mean a = .0'3, mean rb=.52) was slightly 
higher than that for the conventicial test items (mean 
mean r^=.'i7). The results of the present study contradict 
Lord's findings from a variety of theoretical studies (Lord, 
IM70, I'^Tl a,b,c) showing that p conventional test will always 
provide more information for testees at the mean of the ability 
distribution than will any adaptive test. But Lord's finding.-? 
were based on the use of hy po tlie t x ca I , ideal items whxch were 
;»l I of th*.- sfjuie discriminating power; thus, the relative dis- 
ri-imin.f f it)g power of the items did not influence the superiorits 
of anv particular testing strategy. It will b.- necessary to 
examine the information-providing characteristics of a con- 
vefJtional tvTit with it. 'ins as d i .-^cr imi ua t ing as those used in 
Two-stage 2 before it can be concluded that tlie two-stage Lest 
can provide more accurate measurement around the mean of the ability 



distribution. However, the superiority of Two-stage 1 to the 
conventional test with increasing divergence from the mean 
.'jbility level and itn hl^^her' overall level of precision of 
measurement cannot be attributed to differential item discrim- 
inating power (the items in Two-stage I had a mean ass. 55 and 
n mean r^js:.-!*?, almost identical to that of the conventional 
test) but is instead attributable to the process of adapting 
item difficulties to the characteristics of each individual 
t e s t e e . 



The results of the simulation studies described here 
reflected quite closely the results of the parallel empirical 
study (Betz & Weiss, 1973) with regard to characteristics of 
the score distributions and the degree and nature of the 
relationship between two-stage and conventional test scores. 
The correspondence of the reliability coefficients to the 
squared correlations between test scores and underlying ability 
eind the similarity of the conventional test and Two-stage 1 
information functions to those found in Lord's theoretical studies 
using restrictive assumptions of "ideal" items are further 
evidence as to the validity and utility of the simulation model 
used xn this study. Thus, it is concluded that further simulation 
studies both parallelling and extending the on-going empirical 
research will be useful in exploring the measurement character- 
istics of variations of the two-stage testing strategy. For 
example, most studies of two-stage testing to date have used 
an even number (usually k) of measurement tests? in the present 
study there were four, two at difficulty levels above tie mean 
and two at difficulty levels below the mean. Thus, individuals 
at the mean ability level for whom the routing test (or any 
conventional test peaked at the mean ability level) is most 
appropriate are routed up or down into a somewhat less 
^t^at^^'^^t*'' measurement test. Using an odd number of measurement 
other; ^rrji^t^i^/r'^^' '""^ ^^^^^^^ ^^-^ and ?he ' 

dl^cr^^tia^^Ln^em::^" conventional test scopes giv:^ e^^lly 



Another approach to improving two-stage testing procedures 
would involve using more measurement tests with fewer items. 
However, the narrower the range of routing test scores used 
to assign individuals to measurement tests, the greater the 
likelihood that small errors in the estimation of an individual's 
ability from routing test scores will lead to mis-routing, or 
routing to an inappropriate measurement test. The possibility 
of routing errors is probably the major disadvantage of two-stage 
testing strategies a^ they are currently being studied, and 
significant improvements in the procedure would probably result 
if individuals who had been mxs-routed were identified early 
in the administration of the measurement test and re-routed to 
a more appropriate test. A recovery routine of this type could 
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eaaily be accommodated into tlu^ computet' admini s t t ion ol* 
two-sta^e teHtss and seems to be a necessary and fruitful 
direction for further investigations of the strategy. 

The resuitrt of the present study also help clarify cr^iteria 
which can be used to compare adaptive and conventional strategies. 
Since the reliability coefficients were shown to be a trans- 
formation of the correlation of test scores and ability ^ they 
are appropriate criteria for comparison of strategies. However, 
both reliability coefficients and ability-test score correlations 
showed only small differences between the strategies. Information 
functions, on the other hand, showed considerable gains in pre- 
cision for the adaptive strategy in regions of the ability 

i .s t ri t>u t i on . When it is not possible to compute information 
functions, such as in a live-testing study, the present results 
suggest that differences in reliability coefficients might 
paralltH similar differences in average level of the information 
f unc t ions • 

Summary 

The improvements made in the construction of Two-stage 2 
were reflected by results showing that, in comparison to both 
Two-stage 1 and the conventional test, scores yielded by Two- 
stage 2 better reflected the underlying normal distribution of 
ability, were more reliable, and had a higher relationship to 
underlying ability. Hovever, although the overall level of 
information provided by Two-stage 2 exceeded that of the 
conventional test at all abxlity levels and that of Two-stage 1 
at ability levels within two standard deviations of the mean, 
it failed to yield the horizontal information function that was 
predicted and was found for Two-stage I* Further research is 
needed to determine the conditions under which two-stage tests 
will yield horizontal information functions whose values equal 
or exceed those of conventional tests even at average levels of 
abi 1 i ty . 
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Table A-3 

Item Parameters for the Conventional Test 



Item 
Reference 

Number Difficulty (b) Discrimination (a) 



58 


-.96 




.48 


221 


-.74 




.65 


307 


-.84 




.56 


393 


-.95 




.49 


211 


-.72 




.61 


224 


-.78 




.54 


390 


-.73 




.63 


667 


-.73 




. .7 


156 


-.63 




.65 


208 


-.68 




.58 


234 


-.69 




.51 


52 


-.28 




.61 


137 


-.74 




.40 


176 


-.90 




.34 


207 


-.53 




.60 


218 


-.93 




.33 


205 


-.62 




.47 


382 


-.48 




.64 


391 


-.53 




.48 


626 


-.29 




.65 


643 


- 32 




• 


661 


-.30 




.58 


670 


-.28 




.62 


:27 


-.25 




.57 


50 


-.23 




.50 


144 


-.18 




.63 


369 


-.22 




.56 


233 


17 




.47 


636 


-.15 




.54 


633 


-.08 




.50 


146 


.00 




.61 


295 


-.04 




.47 


113 


.25 




.61 


267 


.19 




.44 


59 


.17 




.64 


271 


.33 




.53 


302 


.37 




.50 


375 


.46 




.49 


666 


.42 




.55 


651 


.4? 




.56 


Mean 


-.33 




.54 


S. D. 


.43 




.08 
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Appendix B 

Routing test scores and corresponding inlclal 
ability estimates used In the assignment of 
testees to Two-stage 2 measurement tests. 



Routing 

Test Mean Difficulty Level 

Score Ability Estimate of 
(number correct) (standard scores) Assigned Measuren^nt Test 



2.5* 


-2.45 


-1.6 (Test 4) 


3 


-1.90 


-1.6 (Test 4) 


4 


-1.20 


-1.6 (Test 4) 


5 


-.69 


-.71 (Test 3) 


6 


-.23 


-.71 (Test 3) 


7 


.23 


.35 (Test 2) 


8 


.75 


.35 (Test 2) 


9 


1.44 


1.73 (Test 1) 


9.5* 


1.99 


1.73 (Test 1) 



*Ability estimates are infinite for perfect scores (10 correct) or for 
scores at or below chance level (5 2 correct). 
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BEST COPY AViULABlE 

Appendix C 

Description of the algorithm far SIMTESTt 
the computer program controlling simu*- 
lated test admin is t rat ion. 



Program SltflEST is written to generate hypothetical ability and four 
test scores for each of 100 "testees^* on each nm of the program. 

Program SIMTEST, written in FORTRAN for a Control Data Corporation 
6400 computer 9 runs in a time-*shared mode and proceeds as follows: 

1. Normal ogive difficulty (b) and discrimination (a) parameters 
are read for each item in the item pool. 

2. An initialization value C'seed^*) is read for the random number 
generator. 

3. Using the '*seed/' 100 ability levels are generated from a theo-- 
retical normal distribution using Subroutine NORMAL (a University 
of Minnesota Computer Center systems subroutine). This sub- 
routine is a pseudo-random generator of real numbers from a nor- 
mal distribution with mean 0 and variance 1. 

4. Subroutine RAN2F (University of Minnesota Computer Center) is 
used to generate 160 random nuiid>ers from a rectangular distri- 
bution of real numbers between 0 and 1. These 160 numbers are 
stored for use in subroutine ITEMSYM (Step 7). 

5* The ability level of the hypothetical subject is sent to one of 

the testing subroutines (TwO'^stage 1, Two-stage 2» or conventional) » 
where the determination of the first or next item to be admini- 
stered is made. 

6. In Subroutine ITEMSYM^ the parameters of that item — a^r, b^» and 
Cjr (the guessing parameter set at .2)--and the individual's 
ability level © are entered into r' ■ following equation: 

P^O) « Ci + (1 " c., *[ai (0 - b^)] 
The result, Pj[(0), is the probability that a person with ability 0 
will answer item 1 correctly. 

7. In Subroutine ITEMSYM, P^(0) is compared to the random number gen- 
erated in Step 4 which corresponds to the order of admlnlstiration 
of item 1. Thus, the value of the random number p^ is compared to 

the probability Pi(0) that the individual will answer item i correctly. 
If Pj^(0) > Pi, the item is scored ^*correct*" 
If PjL(O) < p^, the item is scored '^incorrect. " 

The values of pj^ and P^iQ) occur with sufficient places to the right 

of the decimal point that the chance of Pi«Pj^(0) is extremely small 

and, in fact, has not occurred. 

8. The dichotomized item response is returned to the testing program, 
which stores it and may use it to determine the next item to be ad- 
ministered. After the administration of each 40-item test, the 
total score for that test is calculated for that ^'testee.'^ 

In order to generate a rectangular distribution of underlying ability, 
the procedure described Step 3 was replaced by a procedure in which a 
particular level of ability was read in and used as the underlying ability 
level of all 100 "testees** simulated in that run. 



4Ji 



-Vi- 



Appendlx D 

U nave raged values of the Infortnatlon function for two-stag e 
and conventional tests, from Time I and Time 2 administrations. 



Table D-1 

Information values for Time I and Time 2 administrations of 
two-stage and conventional tests ("equal- frequency" distri- 
bution, N - 100 at each level of 0) 



Ability 



Information (lx(0)) 



Two-stage 1 
Time 1 Time 2 



Two-stage 2 
Time 1 Time 2 



Conventional 
Time 1 Time 2 



3.2 


1.88 


3.14 


1.37 


2.68 


1.06 


.65 


3.0 


2.44 


2.41 


1.61 


1.85 


.03 


.01 


2.5 


2.82 


4.24 


3.47 


3.46 


1.07 


1.18 


2.0 


4.00 


4.18 


3.34 


4.62 


3.37 


3.22 


1.5 


3.33 


3.85 


2.92 


5.84 


3.86 


4.90 


1.0 


3.16 


2.74 


4.83 


5.08 


4.81 


4.03 


.5 


3.46 


2.78 


5.07 


6.36 


6.07 


4.53 


.1 


2.87 


2.46 


6.93 


5.51 


3.96 


4.96 


-.1 


3.09 


2.62 


5.36 


4.14 


4.31 


4.45 


-.5 


3.95 


3.25 


4.07 


5.75 


4.88 


3.62 


-1.0 


3.b5 


3.87 


5.35 


5.09 


3.71 


3.34 


-1.5 


2.63 


4.70 


3.16 


5.23 


3.51 


2.51 


-2.0 


2.C4 


3.73 


2.81 


2.35 


2.33 


2.66 


-2.5 


2.60 


2.26 


2.21 


3.66 


1.38 


1.25 


-3.0 


1.75 


3.07 


.75 


1.51 


.28 


.51 


-3.2 


1.64 


2.55 


.58 


1.00 


.08 


.22 


Mean 


2.88 


3.24 


3.38 


4.01 


2.79 


2.63 


S. D. 


.74 


.76 


1.82 


1.70 


1.92 


1.76 
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Table D-2 



Information values for Time 1 and Time 2 administrations of two-stage 
and conventional tests (normal distribution of underlying ability, 
total N » 10,000) 



Interval 

of 
Ability 



Information (l^iQ)) 



Two-stage 1 
Time I Time 2 



Two-stage 2 
Time 1 Time 2 



Conventional 

Time . I Time 2. 



3.1 to 
2.9 to 
2.7 to 
2.5 to 
2.3 to 
2.1 to 
1.9 to 
1.7 to 
1.5 to 
1.3 to 
1.1 to 
.9 to 
.7 to 
.5 to 
.3 to 
.1 to 
1 to 
-.3 to 
-.5 to 
-.7 to 
-.9 to 
-1.1 to 
-1.3 to 
-1.5 to 
-1.7 to 
-1.9 to 
-2.1 to 
-2.3 to 
-2.5 to 
-2.7 to 
-2.9 to 
-3. 1 to 
-3.3 to 

Mean 
S. D. 



3.3 
3.1 
2.9 
2.7 
2.5 
2.3 
2.1 
1.9 
1.7 
1.5 
1.3 
1.1 
.9 
.7 
.5 
.3 
.1 
-.1 
-.3 
-.5 
-.7 
-.9 
-l.l 
-1.3 
-1.5 
-1.7 
-1.9 
-2.1 
-2.3 
-2.5 
-2.7 
-2.9 
-3.1 



2.94 

3.06 

3.64 

2.23 

3.08 

3.57 

2.64 

4.00 

3.49 

3.40 

3.60 

3.49 

3.04 

3.23 

2.88 

2.55 

3.14 

2.80 

3.18 

3.43 

3.07 

3.12 

3.10 

2.97 

3.32 

2.99 

3.03 

3.71 

2.39 

3.21 

3.13 

2.37 

4.09 

3.15 
.43 



6.08 
13.65* 
4.92 
2.35 
3.98 
4.23 
3.64 
3.95 
3.95 
3.42 
3.93 
3.52 
3.59 
2.94 
2.94 
3.04 
3.01 
3.21 
2.98 
3.71 
3.33 
3.32 
3.71 
4.02 
3.82 
3.85 
3.53 
4.09 
3.56 
3.74 
4.68 
5.11 
3.18 

4.03 
1.87 



1.12 
1.46 
2.22 
4.73 
4.48 
2.98 
3.46 
3.04 
3.96 
4.21 
5.12 
4.65 
4.66 
4.13 
4.60 
5.07 
4.87 
5.27 
5.52 
5.48 
5.09 
5.09 
4.51 
3.79 
4.17 
4.30 
3.09 
2.41 
1.93 
3.59 
1.44 
.32 
.62 

3.68 
1.48 



.51 
1.58 
3. 16 
3.10 
2.74 
4.46 
4.03 
4.38 
3.94 
5.19 
5.26 
5.00 
4.95 
4.27 
4.90 
5.02 
5.74 
5.94 
5.50 
5.82 
5.63 
5.44 
5.03 
4.70 
4.61 
3.71 
3.54 
3.17 
1.87 
1.86 
1.26 
2.07 
7.00 

4.10 
1.55 



.58 
.04 
. 12 
.43 
1.93 
2.02 
4.23 
3.48 
3.46 
3.97 
4.05 
4.36 
4.27 
4.54 
4.78 
4.58 
5.18 
4.37 
4.26 
4.28 
3.87 
3.64 
3.78 
3.85 
3.39 
2.98 
2.97 
2.57 
1.46 
3.10 
1.26 
.71 
.29 

2.99 
1.56 



.42 
.01 
.41 
1.40 
1.85 
2.73 
3.99 
3.64 
4.54 
4.35 
5.19 
4.45 
4.29 
4.96 
4.53 
4.39 
4.51 
4.24 
3.63 
3.70 
3.33 
3.33 
2.87 
2.80 
2.50 
2.11 
1.40 
1.50 
.94 
.33 
.98 
.01 

2.79 
1.63 



*Score variance was extremely small; deleting this value results in a mean 
of 3.73 and a variance of .72. 

**Value was infinite because there was no variance (the two scores falling 
in this interval were equal). 
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